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Abstract 

Comparing or combining results of peptide identification from different search methods with firm 
foundation is impeded by the lack of a universal statistical standard. Providing an E'-value calibra- 
tion protocol, we demonstrated earlier the possiblity to translate either the score or heuristic E-va\ue 
reported by any method into the textbook-defined .E-value, which may serve as the universal statisti- 
cal standard. This protocol, although robust, may lose spectrum-specific statistics and might require 
a new calibration when changes in experimental setup occur. RAId_deNovo circumvents these issues. 
We show, for a class of scoring functions, how RAId_del\lovo uses the respective score histograms 
from scoring all possible de novo peptides to assign accurate, spectrum-specific ^-values, thereby 
creating a calibration-free protocol for accurate significance assignment and for combining search re- 
sults. RAId_del\lovo features four different modes: (i) compute the total number of possible peptides 
for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a 
scoring function, (iii) reassign E-values for a list of candidate peptides given a MS/MS spectrum and 
the scoring functions chosen, and (iv) perform database searches using user-selected scoring func- 
tions. In modes (iii) and (iv), RAId_deNovo is capable of combining results from different scoring 
functions using spectrum-specific statistics. 

The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/qmbp/raid_denovo/index.html 
Relevant binaries for Linux, Windows, and Mac OS X are available from the same page. 
Keywords: de novo sequencing, database search method, peptide identification, P-value, E- 
value, SEQUEST, Mascot, XlTandem, RAId_DbS 

1 INTRODUCTION 

Gaining popularity in biology over the last decade, mass spectrometry (MS) has become the core 
technology in the field of proteomics. Although this technology holds the promise to identity 
and quantify proteins in complex biological mixtures/samples, such a goal is not yet achieved 
due to the presence of a number of difficulties ranging from experimental designs, experimental 
*To whom correspondence should be addressed: yyu@ncbi.nlm.nih.gov 
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protocol standardization, to data analysisi^. Since this paper is focused on the statistical 
aspect of peptide identifications, we will start with such an example. 

In general, it is rather easy to rank candidate peptides given a tandem mass spectrum. Once 
a scoring function is given to score peptides, qualified database peptides (those within molecular 
mass range and correct enzymatic cleavages) can be ranked based on their scores. However, it 
becomes difficult to rank candidate peptides across all spectra. Although a number of publi- 
cations have proposed different ways tailored to deal with various aspects of this difficulty— >£, 
this problem remains very challenging. Should one take the best candidate peptide per spec- 
trum and then postprocess to globally re-rank those best hits or should one device something 
different to achieve the maximum robustness? Instead of discussing the differences between 
these two main directions, we like to first point out a common theme that is often unnoticed: 
spectrum-specificity. 

1.1 Spectrum Specificity 

Due to the intrinsic experimental variability, difference in the peptide chemistry, peptide-peptide 
interactions, ionization sources, and mass analyzers used, it is natural to expect variations in 
signal to noise ratios in the tandem mass spectra even when each peptide in the mixture has equal 
molar concentration. That said, one anticipates the noise in a mass spectrum to be spectrum- 
specific. In fact, when one picks the best hit out of each spectrum, the spectrum-specificity is 
acknowledged. This is because one has made the choice to take the best candidate per spectrum 
regardless of the occurrences that the best hit in one spectrum might have lower score than the 
second best hit in the other spectrum. In other words, by picking only the best hits one has 
endorsed that the score should not be used as an objective measure of identification confidence 
across all candidate peptides; or more precisely, the meaning of score depends on its context, 
i.e., the spectrum used. 

There exists a different route to acknowledge the concept of spectrum-specificity. That 
is to use spectrum-specific score distribution to assign an .E-value to each candidate peptide 
per spectrum. Although the term spectrum-specific statistics was not explicitly mentioned, the 
proposal of Fenyo and Beavis^ to fit per spectrum the tail of score distribution to an exponential 
represents the first attempt, to the best of our knowledge, in this direction. The concept of 
spectrum-specific statistics was formally introduced by Alves and Yu— . The same group also 
developed RAId_DbS— , so far the only database search tool with theoretically derived spectrum- 
specific score distribution. The importance of spectrum-specific statistics is then emphasized 
through a series of publications^^^. The key point of this type of approaches is to exemplify 
spectrum-specificity via spectrum-specific score statistics. After describing the common theme, 
spectrum specificity, we now turn to features associated with different type of approaches to 
elucidate the usefulness of an even more general statistical framework. 



2 



1.2 Best hit per spectrum versus Accurate E- value 

When using the strategy of keeping only the best hit per spectrum, a global re-ranking among 
those best hits, to decide which best hit to trust over the others, becomes necessary. This is 
usually achieved by using either the original score in conjunction with either false discovery 
rate (FDR) or q-value analysis through introduction of a decoy database, or by using some 
kind of refined score in conjunction with empirical expectation-maximization-based Bayesian 
approach—. This type of strategies, unfortunately, makes assumptions contradicting spectrum- 
specificity, a fundamental fact that was respected when only the best hit per spectrum is retained. 

In the FDR (be it global or local) or q-value type of analyses, one pools together the best 
hits across spectra and order the hits by their scores. This contradicts the idea of picking best 
hit per spectrum, which essentially endorses that the meaning of peptide score is spectrum- 
dependent and can't be used to rank peptides globally across spectra. For the Bayesian type of 
analyses^, one assumes the existence of two score distributions: one for the score of correctly 
identified spectra, in terms of best hit, while the other for the score of incorrectly identified 
spectra. This means that all correctly identified spectra -in terms of best hit- should be ranked 
according to the best hit's refined score, implying that one may use the refined score to assign 
relative identification confidence across spectra. This again contradicts the idea that the meaning 
of peptide score should be spectrum-dependent. Furthermore, one often needs to assume the 
parametric forms of the two distribution functions to perform the expectation maximization 
procedure, which might not be applicable to all scoring functions. 

When the reported spectrum-specific E'-value (assigned to each of the candidate peptides 
per spectrum) is in agreement with its definition, it can serve as an objective measure of iden- 
tification confidence. For a given spectrum and a score threshold, E-v&lue associated with that 
score threshold is defined to be the expected number of false hits that have score better than or 
equal to that threshold. In simple terms, the E-v&lue associated with a candidate peptide may 
be viewed as the number of false hits anticipated, from querying a spectrum, before calling the 
peptide at hand a true hit. However, a previous study— showed that most E- value reporting 
methods investigated report inaccurate E-values. To rectify this problem, we provided a proto- 
col— to calibrate -E-values reported by other search methods, including search tools that don't 
report E- values such as ProblD^ and SEQUEST—. However, the calibration procedure cannot 
restore/recreate spectrum-specificity for methods not reporting E- values or reporting E- values 
that are not obtained via characterizing score histogram per spectrum (spectrum-specific score 
modeling) . 

Nevertheless, spectrum-specific statistics can be obtained provided that one extracts statis- 
tical significance from the score histogram per spectrum—. A recent reimplementatio n 14 i 15 i 16 of 
the SEQUEST XCorr follows exactly this idea. To avoid possible confusion, however, we must 
first note that the p*-value in reference^ is actually E'-value. Authors of reference^ assume 
that the XCorr from every spectrum can be fitted by a stretched exponential without providing, 
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like most other methods, a measure on the agreement between the best fitted parametric form 
and the score distribution per spectrum. To ensure the accuracy of statistics, a measure of the 
goodness of the model— £ is actually necessary even for scoring systems that have theoretically 
characterized distribution. This is because very biased sampling might lead to discrepancy be- 
tween the theoretical distribution and the score distribution, not to mention the discrepancy 
between the fitted parametric form and the score distribution. 

One way to circumvent the aforementioned problem is to apply target-decoy strategy at the 
per spectrum level. This means that one uses the hits from decoy database to estimate the 
identification confidence of peptides from the target database. This approach, unfortunately, is 
not computationally efficient because one will need a decoy database that is much larger than 
the target database in order to have a good estimate of the P-value for each hit in the target 
database. For example, if the number of qualified peptides in decoy is 1, 000 times of that in the 
target, and if a peptide in the target scores between the third and the fourth of the decoy, then 
that peptide will acquire an E- value between 3 x 10 -3 and 4 x 10~ 3 . And if there are target hits 
that score better than the best decoy hit, all one can say is that they all have P-values smaller 
than 10~ 3 . If one keeps increasing the size of the decoy, one will eventually be able to globally 
rank the candidate peptides from all spectra using P-value. However, computational efficiency 
prevents us from using this strategy. 

These aforementioned problems associated with obtaining spectrum-specific statistics can 
be avoided provided that one uses a search method that has a theoretically derived score dis- 
tribution—. However, restricting to methods that have theoretically derived statistics is not 
necessarily the best strategy since each search method does have different strengt h 10 ' 18 . It can 
be advantageous to combine different types of search scores. Therefore, for assigning peptides' 
identification confidence, it is desirable to have a unified framework which we now turn to. 

1.3 Calibration-free de novo Statistics 

Alves and Yu in 2005 proposed— to use the de novo rank as the statistical significance measure. 
The fundamental idea is the follows. For a given MS/MS spectrum a with parent molecular 
mass M and a given mass error tolerance 5, we denote by n(cr, 5) the set of all possible peptides 
subjected to enzymatic cleavage condition in the mass range [M — 5, M. + b\. We also denote by 
A(cr, 8) the set of peptides in the (target) database, subjected to enzymatic cleavage condition, 
in the mass range [M — 5,M + 5]. The following argument is also applicable to the case when one 
wishes to weight each de novo peptide by its elemental composition. This may be used to form 
a de novo background model mimicking the amino acid composition in the target databas e) 19 ' 20 . 

Let N(S, a) be the (weighted) number of peptides out of n(cr, 8) that have scores higher than 
or equal to S. We then define the de novo P-value corresponding to score S by N(S, a)/\U(a, S)\, 
with |n(<r, 5)\ representing the total (weighted) number of peptides in the set n(cr, 6). In general, 
for a given spectrum a and a score cutoff S, the P-value P(S\a) refers to the probability for a 
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qualified random peptide to reach a score better than or equal to S when using spectrum a as 
query. If a database contains qualified, unrelated random peptides, one will expect to have 
E(S\a) = NdP(S\a) number of random peptides to have quality score better than or equal to 
S. This expectation value E(S\a) is by definition the E- value associated with score cutoff S. 
The S-value associated with a peptide of score S using de novo P-value will therefore be 

where the spectrum-specific E(S\o~) represents the E- value for a hit with score S when the 
spectrum a is used as the query and |A(<r, 5)\d represents the total number of peptides in the 
set A(<7, 5). When casted in the aspect of per spectrum target-decoy approach, II(<t, 5) \ A(cr, 5) 
represents the largest possible decoy database, which is supposed to provide numerically the 
finest E- values for candidate peptides in the target database. (The symbol \ is called "setminus". 
A \ B can be called A minus B in the set sense or called complement of B provided that set 
A is the largest set considered and every set is a subset of A) Let N'(S\o~) be the (weighted) 
number of peptide hits in the target with score greater than S, the per spectrum target-decoy 
approach will have 

P 'A'- m N(S,a)-N'(S,a) N(S,a) 

E t _ d (b\a) = \A(a,d)\ d —— — — p— « \A{a,d)\ d - 



\U(a,5)\A(a,5)\ 1 v ' na \U(a,5)\ 

where the last result comes from N'(S,a) <C N(S,o~) and |n(cr, 5) \ A(a,6)\ ~ |n(c, S)\ for any 
practical applications. 

For a typical molecular mass of 1500 Dalton (Da) and in the absence of weighting, |II(<7, ±lDa)\ 
5 x 10 15 while |n(<7, ±lDa) \ A(a, ±lDa)\ ss 5 x 10 15 - 3 x 10 3 ss 5 x 10 15 for a typical organis- 
mal database such as that of Homo sapiens. In the presence of peptide weighting, one still has 
\U(a,±lDa)\/\U{a,±lDa)\A(a,±lDa)\ ps 1. Therefore, \U(a,6) \ A(a,5)\ » \U(a,5)\. As for 
N'(S, a) versus N(S, a), by definition N' = for best target hit and N(S, a) typically increases 
much faster than N'(S,a) when S is lowered, thus N'(S,o~) <C N(S,a), a fact also observed in 
reference^. Consequently, N(S,a) — N'(S,a) « N(S,a) is a very good approximation. There- 
fore, the de novo based statistics also serve as the best per spectrum target-decoy statistics. The 
only question now is how does one get the score distribution from all possible peptides? 

It turns out that if the score of a peptide is the sum of local contributions, then it is possible 
to construct the score histogram of all possible peptides via dynamic programmin g 19 ' 20 . When 
there exists intrinsically nonlocal contribution in peptide scoring, it is no longer possible to 
obtain the full histogram by dynamic programming. However, it is still possible to estimate the 
de novo rank via a scaling approach— similar to that used in statistical physics. The key point, as 
will be shown later, is that for the four scoring functions implemented in RAId_deNovo, by using 
the de novo statistics, it is no longer critical to theoretically characterize the score distribution 
obtained from database search. This is because the E-v&lue obtained via RAId_deNovo does 
agree well with the textbook definition. The de novo based statistics employed by RAId_deNovo 
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may be extended to provide robust spectrum-specific statistics for scoring functions that do not 
have theoretically characterized score distributions. One advantage to have a method that can 
provide robust spectrum-specific statistics for different scoring functions is that if the i?-value 
reported by each method agrees with its definition, one can compare and combine search results 
from different search methods— 

1.4 RAIcLdeNovo, its Four modes and Paper organization 

We have implemented in RAId_deNovo four scoring functions: RAId (RAId_DbS— scoring func- 
tion), Hyperscore (~ XITandem— ), XCorr(~ SEQUEST— ), and K-score22>22(~ K-score plug-in 
for X!Tandem). The symbol ~ means "mimicking the scoring function of, that is for a search 
method we do not include in the corresponding scoring function the unpublished heuristics (ex- 
isting in the individual code), but only attempt to retain the published scoring function of that 
method. 

RAId_deNovo provides the users with four different modes to choose from: (i) compute the 
total number of possible peptides (TNPP), (ii) generate score histogram, (iii) reassign E'-values, 
and (iv) database search. In mode (i), by specifying a parent ion molecular mass and the mass 
error tolerance, RAId_deNovo can calculate the number of de novo (all possible) peptides within 
the molecular mass range. In mode (ii), given a spectrum and a user-selected scoring function, 
RAId_deNovo generates for the user the corresponding de novo score histogram resulting from 
scoring all possible peptides. In mode (iii), specifying scoring functions, the user may upload, 
along with the tandem mass spectrum, the output files from Mascot— (.dat), SEQUEST(.out) or 
X!Tandem(.xml) to RAId_deNovo to compute the candidate peptides' de novo .E-values. If the 
user prefers, it is also possible to upload the user's own list of candidate peptides for statistical 
analysis instead of uploading the output files from other search tools. Finally, in mode (iv), 
the user may also select to use RAId_deNovo as a database search tool. In both mode (iii) 
and mode (iv), the user can choose multiple scoring functions and RAId_deNovo will, for each 
candidate peptide, combine the statistical significances respectively reported by the scoring 
functions selected by the users. As we will explain in the Method section, RAId_deNovo can 
also incorporate the post translational modifications (PTM) of amino acids in computing the de 
novo statistics. 

When used as a database search tool (mode (iv)), RAId_deNovo first score all qualified 
database peptides. RAId_deNovo then calculates the de novo score histograms corresponding to 
each of the scoring functions selected and assign each database peptide a list of ^-values, each 
of which corresponds to a scoring function selected. RAId_deNovo then applies the method pro- 
posed earlier— to obtain for each database peptide a combined E- value and rank the candidate 
peptides according to their combined -E-values. Therefore, from the combining search methods 
point of view, RAId_deNovo may be regarded as a sequel to a previously reported approach—. 
The difference lies in that here one no longer needs to calibrate the statistics prior to combine 
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the search results because the spectrum-specific statistics for each scoring function provided is 
now computed on-the-fly. 

We now outline the organization of this paper. In the Fundamental Construction section, we 
will first sketch the dynamical programming algorithm needed to generate the score distribution 
of all possible peptides, describe how to incorporate the four scoring functions since some of 
them are not simple sum of local contributions. Due to their importance, the spectral filtering 
methods associated with different scoring functions will also be illustrated. In the Method 
section, in addition to describe a strategy to include PTMs in the de novo statistics, we will 
review a statistically sound method to combine the search results from using different scoring 
functions. The four different modes one may run RAId_deNovo in will also be discussed. In the 
Results section, we describe several tests performed using various modes of RAId_deNovo, as 
well as the E- value accuracy assessment. The paper is then concluded by the Discussion section 
and all the technical aspects that are not essential in understanding the basic idea of this paper 
are relegated to the Appendices. 

2 FUNDAMENTAL CONSTRUCTION 

2.1 Basic Dynamic Programming Algorithm 

To generate the score histogram of all possible peptides in a speedy manner, RAId_deNovo does 
not score every possible peptide individually. As a matter of fact, it is impossible to score every 
possible peptide individually. For example, consider a typical parent ion molecular mass of 1, 500 
Da, it can be shown that the TNPP within 1 Da of this molecular mass is more than 10 15 . Even 
if one has a simple scoring function and a fast computer that can score one hundred millions 
peptides per second, it will take more than 116 days of computer time to do the score histogram 
for a single spectrum. 

In real application, one needs to analyze a spectrum in a short enough time. How could 
one achieve this? One may use a 1-dimensional (ID) mass grid to encode/score all possible 
peptide s 19 i 20 . At each mass index of the grid, the local score contribution associated with 
all partial peptides reaching that location is computed only once and this information may 
be propagated forward to other mass entries via dynamic programming, making it possible to 
generate the score histogram for all possible peptides without individually scoring all peptides. 
In the score histogram, instead of counting number of peptides associated with a certain score, 
it is also possible to weight each peptide sequence according to its elemental composition. For a 
peptide sequence [a\, 02, ■ ■ ■ , «m] , one may assign it a weight 19 ' 20 p{a\)p{a2) ■ ■ ■ p{clm) with p{ai) 
being the emitting probability of amino acid a, . Figure [1] illustrates along with its caption the 
basic ideas mentioned in this paragraph. 
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2.2 Spectral Filtering 

Before describing the scoring functions, the major component for peptide database search tools, 
we first mention spectral filtering, an often under-emphasized but equally important ingredient. 
Starting with a raw tandem mass spectrum, spectral filtering produces a processed spectrum 
that is used to score candidate peptides in the database. Apparently, information kept in 
the processed spectrum plays an important role in the effectiveness of a tool's performance 
in database searches. Tailored for different scoring functions, different filtering strategies are 
employed by different search tools. In order for RAId_deNovo to capture the essence of a scoring 
function, it is very important for RAId_deNovo to produce, for every input raw spectrum, a 
filtered spectrum that is as close as possible to the one produced by other search tool's filtering 
protocol. For most search tools, the filtering heuristics are not clearly documented. For that 
reason, it becomes necessary to delve into the source code of the search program to find out each 
method's spectral filtering protocol. We are thus limited to search tools whose source programs 
are available or those with filtering strategies clearly documented. 

For RAId score, the spectral filtering strategy was described in the earlier publication 8 . 
For Hyperscore, XCorr, and K-score, the details of spectral filtering will be described in the 
appendix. Since the SEQUEST source code is not available, for XCorr score we attempt to 
replicate the filtering of Crux—, a search method that has been shown to reproduce SEQUEST 
XCorr—. To demonstrate that the filtering strategies extracted are accurate, we plot in Figure [2] 
the spectral correlation histograms between the filtered spectra produced by RAId_deNovo's 
Hyperscore/XCorr /K-score with the filtered spectra from X!Tandem/Crux/X!Tandem(with K- 
score plug-in). As shown in these histograms, RAId_deNovo is able to produce identical filtered 
spectra as those generated by the canonical programs. Although the spectral filtering strategies 
associated with search methods investigated seem stable, it is still possible that the developers 
may change their filtering strategies in the future. When that happens, one should be able to 
update RAId_deNovo to reflect the filtering changes provided that the source programs are still 
accessible and clearly documented. 

Instead of elaborating on various filtering strategies, let's first use a real experimental spec- 
trum to demonstrate the effect of spectral filtering employed by different methods. Figure [3] 
shows the raw spectrum, and the filtered spectra processed by the four scoring methods men- 
tioned. The general trend is as follows: RAId score usually produces the filtered spectrum 
that resembles the original spectrum the most; Hyperscore filtering also produces a processed 
spectrum that is similar to the original spectrum; for XCorr and K-score the filtered spectra in 
general look quite different from the original spectrum. The difference in the filtered spectra 
might be a major source contributing to the fact that different search methods have different 
and often complementary strengths. The correlation between any pair of filtering strategies can 
be quantified. Starting with a large set of raw spectra, one may process these spectra with a pair 
of different methods. For each raw spectrum, one obtains two different filtered spectra and can 
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compute their correlation. The correlation between every pair of filtered spectra can be collected 
to form the correlation histogram, reflecting the correlation between a pair of filtering strategies. 
The details of filtering strategies and correlation histograms are provided in the appendices. 



2.3 Scoring Functions 

To better express the scoring functions, let's first define the following notations. For a given 
peptide tt, the set of corresponding theoretical mass over charge (m/z) ratios taken into con- 
sideration by a scoring function is called T(ir), which is also used to indicate the number of 
elements in the set T{n) whenever no confusion arises. The set T(tt) varies from software to 
software. However, the fragmentation series (a n , b n , 6„— 18, b n —l7, c n , x n , y n , y n —18, y n — 17, z n ) cover 
what most methods consider. The Heaviside step function 6{x) is defined by 6{x < 0) = and 
9(x > 0) = 1. We introduce Ii as a short notation for I(rrii), the peak intensity in the processed 
spectrum chosen to associate with theoretical mass rrii. The corresponding mass (from exper- 
imental spectrum) to I{ usually does not coincide with rrii. The absolute difference between 
the chosen mass and the theoretical mass rrii is denoted by Arrij. The notation I[ is used in 
place of I{ when the preprocessing of the spectrum involves a nonlinear transformation of the 
peak intensity or involves generation of additional peaks. We now list the four different scoring 
function implemented: 

-, too 

RAIdS(vr) = J2Hli)e- Ami e(l-A mi ) (1) 



Hyperscore S(tt) = 4 log 
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XCorr S(vr) = ^ £ wrf (3) 

i=l 

q c , , 1000 ln(Z)^ T , us 

K - Score 5(tt) = ^ ^ Wi l' t (4) 

The first scoring function listed is employed by RAId_DbS^; the second one mimicks the 
hyperscore (X//) of XITandem— ; the third one mimicks the XCorr score used in SEQUEST 
and is similar to what was implemented in Cru x 16 ' 15 ; the last one mimicks K-score^, a plug- 
in for XITandem. For the RAId score, the set T(ir) includes only the b- and y-series peaks. 
For the Hyperscore, T(ir) includes {b n , y n }. For XCorr, T(n) includes {b n , y n , b n — 1, b n + 
1, y n — 1, y n + 1, b n — 18, b n — 17, y n — 17, a n } with the corresponding weights given by 
{1, 1, 0.5, 0.5, 0.5, 0.5, 0.2, 0.2, 0.2, 0.2}. As for K-score, T(tt) includes {b n , y n , b n -l, b n + 
1, y n — 1, y n + 1} with the corresponding weights given by {1, 1, 0.5, 0.5, 0.5, 0.5}. 

Very often it is useful to include the peptide length in the scoring of a peptide. Using RAId 
score as a simple example, two peptides of length 11 and 16 may achieve the same raw score — 



9 



S[q = 10, sum of the logarithm of evidence peak intensity. A longer peptide consists of a longer 
list of theoretical peaks to look for and may thus score higher by chance. RAId_DbS scoring 
function— deals with this issue by dividing the raw score by the length of the theoretical peak list. 
Upon doing so, one has S n = S' ll /{2 x (11-1)) = 1/2 and S m = S' 16 /(2 x (16-1)) = 1/3. This 
score normalization may help in discriminating true positives from false positives. The other 
scoring function utilizes the peptide length information is the K-score. Hyperscore employed 
by XITandem uses a slight different score renormalization strategy. Inside the logarithm, the 
Hyperscore contains two factorials, b\ and yl, where for each candidate peptide b (y) represents 
the total number of 6-series (y-series) evidence peaks found in the spectrum. Either the length 
of a peptide or the factorial function is apparently not sum of local contributions. Therefore, 
one needs to extend the basic algorithm outlined in the previous subsection to accommodate 
these additional information needed for scoring. 

As documented in reference^, it is possible to introduce additional structures in the score 
histogram associated with each mass grid. The flexibility to introduce additional structures of 
various dimensions makes RAId_deNovo a versatile tool: it can incorporate the scoring functions 
that utilize length information or number of 6-series (y-series) peaks to compute the final peptide 
score. Using peptide length as an example, Figure H] demonstrates the inclusion of additional 
structures. More detailed exposition about the inclusion of internal structures can be found in 
reference^. 

Although the spectral filtering parts of various scoring functions are replicated exactly, a 
candidate peptide may receive different scores from RAId_deNovo and the original programs. 
This phenomenon can be seen from Figure [5] (Figure [6]) : the ordinate of each data point displays 
the search score of the best hit of a centroid (profile) spectrum using the original programs, 
while the abscissa of the same data point shows the score reported by RAId_deNovo. 

The major source of score difference is due to RAId_deNovo's omission of heuristics while 
implementing a published scoring function. For each scoring function, many scoring heuristics 
are present in the source code. While some of the heuristics cannot be included via dynamical 
programming, all these heuristics are either not described or not justified in the original papers. 
For these reasons, RAId_deNovo does not include those unpublished heuristics. Therefore, the 
Hyperscore/XCorr /K-score scoring functions implemented in RAId_deNovo should be regarded 
as our attempt to mimick the original Hypersocre/XCorr/K-score scoring functions. Although 
the scoring functions we implemented are not exact replicate of the original ones, due to omission 
of heuristics, we can see from Figure [5] and [6] that there exist strong correlation between the 
scoring function implemented in RAId_deNovo and the original scoring function. In other words, 
the scoring functions implemented in RAId_deNovo do capture the essence of these original 
scoring functions. 
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3 METHOD 



We now describe the main feature of RAIcLdeNovo: the capability of combining peptide iden- 
tification results from multiple scoring functions based on spectrum-specific de novo statistics. 
As emphasized in an earlier publication—, the key to successfully combine search results from 
different search method is to have a universal statistical standard such as using accurate E- value. 
Since not many search methods report accurate .E-values^ in agreement with the textbook def- 
inition, a statistical calibration protocol was proposed— to turn scores /E- values into accurate 
i?-values. However, when the experimental protocols are modified, a re-calibration of statistics 
may be necessary—. Although limited thus far to the four scoring functions provided, the merit 
of RAId_deNovo is now clear. One no longer needs to calibrate the statistics prior to combining 
the search results because the spectrum-specific statistics for each scoring function provided is 
computed on-the-fly. In the following subsections, we will first describe in more details regard- 
ing how to extract P-values or ^-values from de novo score statistics. The inclusion of PTM 
amino acids in the de novo statistics is described next. We then review a previously published 
method to properly combine search results, followed by the illustration of various other features 
of RAId_deNovo. 

3.1 De novo Statistics: practical implementation 

In section 1.3, we have described the theoretical idea of how to use de novo statistics to obtain P- 
values and E- values with or without weighting each de novo peptide by its elemental composition. 
In this subsection, we will complement the theoretical idea by providing some pragmatic parts 
of the implementation. 

In order to build the score histogram fast, it is necessary to discretize the score thereby 
compromising to some degree the score precision. However, this rounding of scores does not 
affect peptide scoring when using RAId_deNovo as a database search tool or a tool to provide 
statistical significance for a list of peptides. Specifically, the evidence score collected at each 
mass grid is stored in two formats: one with much higher precision and the other rounded to 
nearest integer. The rounded values are used in dynamical programming to propagate the score 
histogram forward, facilitating a speedy construction of the score histogram. The slight error 
introduced in individual peptide scoring does not influence the accuracy of the score histogram 
much since these errors largely cancel each other when lumping the scores into a histogram. 
In the database search mode, RAId_deNovo will sum the high precision evidence scores in the 
mass grids traversed by the candidate peptide being scored. Therefore the score associated with 
each candidate peptide in the database search mode has a better resolution than that in the 
score histogram. To obtain the statistical significance associated with each candidate peptide, 
RAId_deNovo performs an interpolation procedure to obtain the P-value, 
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Multiplying the P-value by the number of qualified peptides |A(<r, <5)|d in the target database 
provides the E- value 

E(S,a) = \A(a,6)\ d P(S,a) . 
3.2 De novo Statistics including PTM amino acids 

Since proteins do contain PTM amino acids, it is important for peptide identification tools to 
consider amino acid modifications in the statistical analysis. By scoring only qualified peptides, 
database search methods has little problem of including PTM amino acids provided that the score 
distribution is theoretically characterizable. For de novo based statistics, however, additional 
care must be taken to include PTM in the statistics. 

When used as a database search tool, RAId_deNovo needs to assign statistical significances 
to candidate peptides, many of which may contain PTM residues. In order to apply de novo 
statistics here, the most important question is the estimate of the emission probabilities, needed 
for score histogram construction, associated with PTM residues. RAId_deNovo deals with this 
problem in a simple manner. Given a list of peptides, for each amino acid B RAId_deNovo 
first count the number of unmodified amino acids n{B) and n(Pj), the number of amino acid 
B modified to different form Bi with i = 1, . . . , k. RAId_deNovo then proportionally distribute 
the emission probability po(B) associated with amino acid B to all the possible modified forms 
using the following formula 

n(B) + 1 
) +T+Eii n 
n(Bj) 

n(B) + l + ELin(B k ) 

Effectively, one pseudocount is always given to each unmodified amino acid. 

Therefore, for a given list of peptides, RAId_deNovo will count the total number of distinct 
amino acids modifications. In principle, RAId_deNovo can incorporate all those modified amino 
acids in the de novo score histogram construction. However, to maintain a speedy score his- 
togram construction, RAId_deNovo only retain up to ten most abundant PTMs in calculating 
the new emission probabilities. This new set of normalized background frequency (with most 
abundant PTM included) may then be fed into our RAId_deNovo to compute the correspond- 
ing de novo score histogram. The histogram obtained is then used to calculate the statistical 
significance of each reported peptide. 

Although rare PTMs in the peptide list might be omitted in constructing the de novo score 
histogram, the impact to the statistical significance accuracy is minute. For if one were to 
include those PTMs, due to their small normalized emission probabilities, peptides containing 
those PTMs will be weighted substantially smaller than others and thus not significantly af- 
fecting the shape of the score histogram. As for the emission probability po(B) associated with 
amino acid B (regardless of modifications), see eqs. dMS]), one may use either known amino acid 
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background frequencies such as the Robinson-Robinson— frequencies or can calculate the num- 
ber of occurrences of all amino acids in a molecular-mass-specific and database- specific manner. 
The former approach is adapted by RAId_deNovo when the number of peptides (provided by 
the user or extracted from the database) is less than 2, 000; otherwise, the latter approach is 
employed. There exists, of course, room for improvement in terms of including PTMs in the de 
novo statistics. Different alternatives are currently under investigations. 

3.3 Combining Search Results from Different Scoring Functions 

In mode (iii) and mode (iv) when the user select multiple scoring functions, RAId_deNovo is able 
to combine statistical significances reported by different scoring functions. For mode (iv), that 
is, database search, the protocol to combine search results is identical to what was addressed 
before^. In this section, we will briefly review this method. 

For a given spectrum a, to combine search results from m scoring functions (say scoring 
function A\, . . ., A m ), we first construct a union peptide list C{a) = Ca x (o-) U . . . U £yi m (o"), 
where £^(0") is the reported list of peptide hits by method Ai for spectrum a. A peptide in 
the union list has at least one, and may have up to m P-values derived from de novo P-values, 
depending on how many scoring functions reported that specific peptide in their candidate lists. 
Each of the .©-values associated with a peptide will be first transformed into a database P- 
value^. If one were to assume that the occurrence of a high-scoring random hit is a rare event 
and thus can be modeled by a Poisson process with expected number of occurrence E(S\a), one 
may then define another P-value, which is called the database P-value, via 

P db (S\a) = l-e~ E ^ . (7) 

For a given peptide ir, for method(s) that did not report tt as a candidate, the associated 
database P-value(s) of n from that (those) method(s) is (are) set to one. After this procedure, 
each peptide in the list C(o~) have m database P-values (Pi, P2, . . . , P m ). Let r = YliLi Ph the 
combined P-value is given by— 

fc=o fc " 

Once Pcomb is obtained, we may invert the formula in Eq. ([7]) to get a combined ©-value ©comb 
via 

Eq. ([8]) is applied to obtain the final P-value associated with ir. The final P-value P CO mb( 7 '") 
will then be transformed into a final P-value P C omb( 7r ) via Eq. ([9]). We then use P C omb( 7r ) as the 
final P-value to determine the statistical significance of peptide candidate ir, similar to what is 
used in reference^. 
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Suppose one has obtained a list of candidate peptides from some analysis tools that provides 
only crude statistical significance assignment or no significance assignment at all, it is possible 
to upload this list of peptides along with the spectrum to RAId_deNovo to get a reassignment 
of statistical significance via mode (iii) of RAId_deNovo. The fundamental idea here is to first 
obtain the score histograms corresponding to the list of scoring functions selected. With the 
histograms constructed, one can attain the P-values for any score specified. Therefore, for a 
chosen scoring function and a given list of peptides, RAId_deNovo can provide for each peptide 
a de novo P-value by scoring each peptide and then infer from the normalized score histogram. 

In practical implementation, RAId_deNovo sorted the list of peptides according to their 
molecular masses and identify their corresponding mass indices on the de novo mass grid. Using 
these indices as terminating points, but one at a time, RAId_deNovo constructs score histograms 
assuming that the parent ion weight is given by the mass indices considered. Each peptide in the 
list is then rescored using the user-selected scoring versions implemented in RAId_deNovo and the 
-P-values corresponding to these scoring functions are obtained. If no further information other 
than a flat list of peptides is given, RAId_deNovo will combine these P-values using eq. ([8]) and 
return a combined P-value for each peptide in the list. When the number of qualified database 
peptides is known -which is the case if one directly uploads to RAId_deNovo any of the output 
files of Mascot, SEQUEST, or X!Tandem- RAId_deNovo will first transform the P-values into 
P-values and then into database P-values (eq. ([7])). For each peptide in the list, RAId_deNovo 
will then combine their database P-values using eq. © and then obtain the final P-value via 
eq. ©. 

4 Results 

4.1 E- value Accuracy 

Section 1.3 has demonstrated that statistical significance assignment based on using the de novo 
score histogram is spectrum- specific. However, one must verify the P-value accuracy before 
claiming an accurate spectrum-specific statistics is achieved via de novo statistics. A straight- 
forward way to test P-value accuracy— is to compare the averaged number of false positives (the 
textbook definition) versus reported P-value using a spectral dataset resulting from a known 
mixture. To be specific, one will first eliminate true positives from a database, and then use the 
spectra from a known mixture as queries to look for peptide hits. Since the true positives are 
removed from the database before hand, all the peptide hits are false positives. One then ag- 
gregate all the false positives together -there might be many false positives from one spectrum- 
and then sort them in ascending order of P-value. Let M be the total number of spectra used 
for evaluation, and let Ne<e c be the total number of false positives with P-values smaller than 
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or equal to E c . If the E- values reported are accurate, one expects to see that 

rp N E<E C 

M 

subjected to fluctuations due to finite sampling. 

Figures [7]|9] assess the E- value accuracy using both centroid and profile data with i^-values 
obtained from P- values. Both the centroid data set and profile data set are tryptic and are 
identical to the ones used in reference^. The P-value for a peptide hit is obtained by multi- 
plying that peptide hit's de novo P-value by the number of qualified database peptides with 
similar masses. In terms of enumerating qualified peptides, we employ the RAId_DbS strategy. 
Specifically, we further the qualified into ones with correct and incorrect N-terminal cleavages^ 
and have separate counters for them. If a candidate peptide has correct N-terminal cleavage, the 
factor it multiples will be the total number of database peptides with both correct N-terminal 
cleavages and with masses similar to that of the peptide considered; otherwise, the factor will 
considerably larger since it includes all database peptides with masses similar to that of the 
peptide considered. The protein database used there is the NCBI's nr (same version as in refer- 
ence^) with identical cluster removal procedure^. As shown in Figures 0191 the E- values reported 
by RAId_deNovo using various scoring functions implemented are within a factor of five of the 
textbook definition. For any two scoring functions, if they are independent, one may combine 
the statistics using eqs. (J7H9]) and the combined E- value should also follow the theoretical curves. 

How well the combined E- values reported trace the theoretical line can be used as a measure 
of how independent these two scoring functions are, provided that each scoring function already 
has -E-value reported in agreement with the textbook definition. Similar to reported in refer- 
ence^, the combined ^-values from any two methods in general show a larger deviation from 
the textbook definition. This may be due to correlations between search methods. We are cur- 
rently investigating the possibility of taking into account the search method correlation, which 
is supposed to be spectrum-specific too, while combining the statistics. We will incorporate the 
corrected statistics into RAId_deNovo if the investigation along this direction turns out to be 
fruitful. 

4.2 Combine Database Search Results 

The primary feature of RAId_deNovo is the ability to combine, in a statistically sound way, search 
results from different scoring functions. If the retrieval performance of each scoring function 
implemented is poor, then even if one combines the search results, the final outcome might still 
be poor. Below we assess the retrieval performance of each scoring function implemented using 
the Receiver Operating Characteristic (ROC) curves. 
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4.2.1 First assessment of scoring functions 

Here we investigate the performance of the four implemented scoring functions -RAId score, 
Kscore, XCorr, and Hyperscore- each of which is a standard scoring function, often employed 
with program-specific heuristics, for a known search program. The retrieval efficiency is assessed 
using both profile data (Figure [10| NHLBI data set) as well as the centroid data (Figure [TT1 
ISB data set). Since many search methods report only one or very few candidate peptides per 
spectrum, we also include this type of ROC curves (Figures [T5] and [T5]) where only the best 
hit per spectrum is taken from the search results. The performance of this ad hoc truncation 
apparently leads to better retrieval at small number of false positives, indicating the existence 
of false hits whose evidence peaks are homologous to that of the true positive(s) associated with 
a spectrum. We are currently investigating the impact of the existence of these types of false 
positives on the statistical significance assignment, the results will be reported in a separate 
publication. 

4.2.2 Different ROC analysis 

When the true positive peptides are not known a priori, there exist different strategies in classi- 
fying hits into true or false positives when making a ROC plot. These strategies, unfortunately, 
will make a notable difference in retrieval assessment. For example, in a cell lysate experiment of 
a certain organism, it is customary to estimate the number of false positive hits by introducing 
a decoy database during the data analysis. The main idea there is to first sort the peptide hits 
according to their scores. Then for every hit in the decoy, one assumes that there is a correspond- 
ing false hit in the target and such strategy has been used extensively—. ROC analyses done 
this way generally count false positives, which are highly homologous to the target peptides, 
towards true positives. This has two effects: an overcount of true positives and a undercount of 
false positives. As a consequence, the ROC curves will appear more impressive. To mimick this 
situation, we used BLAST to find in the NCBI's nr database highly homologous proteins to the 
target proteins used in the experiment and include those proteins to our true positive set. This 
strategy produces ROC curves shown in solid curves of Figures [14] and [15] When compared to 
Figure [TUl and Figure [TTT the ROC curves produced by this strategy seem much more impressive. 

Without counting highly homologous proteins as false positives is probably agreeable. How- 
ever, counting those peptides/proteins as true positives is exaggerating. Therefore one may use 
a slightly different strategy: removing from consideration homologous proteins to the target 
proteins, which is also termed as cluster removal strategy—. The dashed curves of Figure [TUl and 
Figure [11] are ROC curves obtained this way. This strategy also induces slightly more impressive 
ROC curves than in Figure [TOl and Figure [TT] Apparently, this indicates the highly homologous 
false positive hits are the ones that degrade the retrieval performance. Thus, it can be useful to 
remove those false positives from consideration. Keeping only the best hit per spectrum turns 
out to be one way to achieve such goal. We are currently investigating a more general approach 



16 



to deal with this issue and will present the results in a separate publication. 
4.2.3 Combining Multiple Scoring Functions 

Since different scoring functions have different spectral filtering strategies, it is often advanta- 
geous to combine the search results from several scoring functions. RAId_deNovo provides a 
simple user interface, allowing users to select several scoring functions at a time. A example 
output when several scoring functions are selected is shown in Table [H 

Figure [16] illustrates the performance when RAId_deNovo combines three different scoring 
functions in its database search mode. When compared with Figure [TU] and Figure QTJ if one 
does not count the ROC curves from keeping only the best hit per spectrum there, the ROC 
curves obtained by combining three randomly chosen scoring functions performs better than 
individual scoring functions. The combination of XCorr and Kscore, however, does not produce 
better results. This may be largely come from their correlations as shown in the spectral filtering 
(Figures H7I and H5]>. 

4.3 Other modes 

Examples of using mode (iv) are already shown above. We demonstrate here other features of 
RAId_deNovo to illustrate its versatility. 

4.3.1 Compute TNPP: mode (i) 

Given a parent ion mass, RAId_deNovo is also able to compute efficiently the TNPP asso- 
ciated with that molecular mass within a user-specified mass error. To obtain TNPP using 
RAId_deNovo, one simply types in the molecular mass of interest and the mass error tolerance 
and then press the "Run" button. When using search methods that do not have a theoretical 
model for the score distribution or when the goodness of the score model— is poor, one may 
wish to use a more conservative statistical significance assignment. In this case, a user may set 
1/TNPP as the lower bound for the best P- value for any given parent ion mass. This may help 
in preventing exaggerated/inappropriate statistical significance assignments. 

The user interface for computing TNPP is self-explanatory. Choosing a specific digesting 
enzyme or considers no enzymatic restriction by choosing "no enzyme" , one simply types in the 
molecular weight of interest in the "Molecular weight" window, specify a different parent ion 
mass error to replace the default value if desired, click on "Run" button and the results will be 
displayed shortly. 

4.3.2 Generate score histogram: mode (ii) 

Extraction of the statistical significance from a score distribution often requires a model, be 
it theoretically derived or empirically assumed, for the score distribution. One may test the 
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robustness of a score model by examining how well the score model can fit the database search 
score histograms. When using search methods that has a score model, one may first test how 
well the same score model apply when dealing with all possible peptides. If the score model 
loses stability, this may indicate that the score model is not robust in general. Given a query 
spectrum and a user-selected scoring function, RAId_deNovo can be used to generate a score 
histogram of all possible peptides under the selected scoring scheme. Using an example spectrum, 
Figure [19] shows score histograms corresponding to the four scoring functions implemented in 
RAId.deNovo. 

4.3.3 Reassign E'-value : mode (iii) 

Statistical significance inference from RAId_deNovo only depends on the total number of qual- 
ified peptides inside the database searched, but not dependent on the peptide content inside 
the database. This is because RAId_deNovo has its statistics founded on the (weighted) score 
histogram obtained from scoring all possible de novo peptides. As a consequence, without going 
through the database search again, RAId_deNovo can be used to reassign statistical significance 
to a collection of candidate peptides. The candidate peptides may come from a flat list provided 
by the user, or they can also come from the output files of various search engines. RAId_deNovo 
allows users to upload the output files from SEQUEST, XlTandem, and Mascot for statistical 
significance reassignment. 

Although the scoring functions similar to XCorr, Kscore and Hyperscore have been imple- 
mented in RAId_deNovo, other search engines' scoring functions might not be suitable for score 
histogram construction using dynamic programming. In this case, the user may wish to exam- 
ine/compare the statistical significance reported by a search engine versus what is reported by 
RAId_deNovo and even combine these reported significances. As an example of this usage and 
to test its performance, we use as queries the 10, 000 profile spectra (NHLBI data set) as well 
as 12,628 centroid spectra (A1-A4 of ISB data set), each produced from a known mixture of 
target proteins. Using Mascot as the search engine, we search in the NCBI's nr database with 
proteins highly homologous to the target proteins removed—. The output files were analyzed 
to produce ROC curves, the black solid curves in Figure [20j We then reanalyze the candidate 
peptides' statistical significance by combining the statistical significance reported by Mascot 
with that reported by RAId_deNovo using one additional scoring function. For both profile and 
centroid spectra, when combined with either the RAId score, Kscore, or XCorr , one may obtain 
a retrieval performance that is comparable or slightly better than that from Mascot alone, see 
Figure [201 

Since all the implemented scoring functions are accessible from RAId_deNovo, one can score 
any new PTM peptide using any of the scoring functions available to RAId_deNovo even when 
the original program does not yet include the PTMs of interest. This way, annotated PTM 
found by RAId_DbS^ may be confirmed with other scoring functions in a natural manner and 
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one may even combine the statistical significance as described below to increase the sensitivity 
in finding annotated PTMs and single amino acid polymorphisms (SAP). 

5 Discussion 

In this section we will discuss another proposed use of the de novo scheme in confidence assign- 
ment, venues for improvement, and future directions. 

When combined with database searches, the score histogram obtained by RAId_deNovo also 
provides two useful quantities. First, it gives us the best de novo score. Although we did not 
pursue this way, it has been advocated that the difference between the optimal de novo score 
and the best database hit score per spectrum may serve as a statistical significance measure for 
the highest-scoring peptide hits found in the database^. Second, the score histogram provides 
us with N s , the (weighted) number of all possible peptides with score better than or equal to 
S. This number N s may also be used in conjunction with the (relative) difference between the 
best de novo score and the best database search score per spectrum while constructing statistical 
significance measures other than E- value. 

By integrating existing annotated information into organismal databases, RAId_DbS is now 
able to incorporate during its data analysis annotated information such as SAP, PTM, and their 
disease associations if they exist^. This feature enables the users to identify/include known 
polymorphisms/modifications in their searches without needing to blindly allow all possible SAPs 
and PTMs first and then post process to look up the literatures/databases for explanations. Since 
all the implemented scoring functions of RAId_deNovo are now within the same framework, we 
can let each plug-in scoring function to incorporate in its scoring the new SAP/PTM peptides. 
This way, annotated SAP/PTM found by RAId_DbS may be confirmed by other implemented 
scoring approach in a natural manner and one may even combine the statistical significance as 
described earlier to increase the sensitivity in finding annotated SAPs/PTMs. 

In the near future, we also plan to include more scoring functions in RAId_deNovo if their 
presence may enhance the retrieval performance without sacrifice statistical accuracy. For exam- 
ple, we will investigate the effect of a new scoring function: compound Poisson. This is a natural 
way to incorporate the intensity information into Poisson count statistics. The other scoring 
approach we will investigate is to deconvolute the peptide length information. The reason for us 
to consider this alternative rises from the observation that many scoring functions introduce dif- 
ferent heuristics to correct for the scores associated with candidate peptides of different lengths. 
The purpose of these peptide length correction factors is to balance the fact that longer peptides 
are likely to find more evidence peaks and thus the collected evidence scores may require some 
length correction in order to make the comparison among peptides of various lengths impartial. 
If we group peptides of the same lengths and obtain statistical significance separately for peptide 
candidates of different lengths, we no longer need to introduce any length correction factor. This 
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approach is not feasible for regular database searches since the sample size of peptides of a fixed 
length may be too small. For our de novo scheme, however, we always have a large number of 
peptides participating in our score histogram even the peptide length is fixed. Therefore, the 
idea of deconvoluting the peptide lengths becomes feasible for RAId_deNovo. The results of 
these new scoring functions will be described in a separate manuscript. 

Appendices 

A: Spectral Filtering and Scoring Function 

The main objective of this appendix is to document what we found from the source codes of 
various search methods about their spectral filtering strategies. Although effort is invested to 
faithfully reproduce these filtering strategies, we have no intent to provide logical explanation 
regarding these filtering methods. Readers interested in obtaining logical explanations of these 
strategies should contact the original code authors. There also exist other heuristics in various 
scoring functions that we chose to ignore. As shown in Figures [TOl and [TT1 and in dashed curves 
of Figures [14] and \TE\ the performance of these scoring functions without heuristics do not suffer 
from poorer retrieval compared to their original implementations with heuristics included. 

Notation 

Before we begin documentation of the filtering strategies associated with different scoring func- 
tions as well as our implementation of these scoring functions, we define a set of notations. 

mw precursor ion molecular weight at charge +1 state 
z peptide/fragment charge state 
m/z mass over charge 

nii molecular mass of fragment i in the MS/MS spectrum 
Ami mass disagreement between the theoretical mass of fragment i and m^. 
Ii peak intensity of an observed 
hw highest observed molecular weight 
lw lowest observed molecular weight 
7r stands for a peptide sequence 
5m MS/MS spectrum m/z accuracy 
Da Dalton 

T(tt) represents the set of theoretical peaks used for scoring 
or the total number of items in that set. 

1(tc) length of peptide tt, the total number of amino acids in peptide tt. 

H molecular weight of a hydrogen atom 
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RAId Score Filtering and Scoring Function 

1) The details of RAId score filtering are explained in RAId_DbS original publication — . 

2) The RAId scoring function (also used in RAId_DbS) is define as 

RAId S(tt) = -f- Y>(Ii) e- Am '6(l - Am,), 

with the default theoretical peaks for scoring T(ir) = {b n ,y n Yn=i * an< ^ w ith l(ir) representing 
the number of amino acids in peptide tt. 

XITandem Filtering and Hyperscore 

1) m/z fragments that are within ±0.95 Da from each other are removed from the MS/MS 
spectrum. When two fragments are within ±0.95 Da of each other the fragment with the 
highest intensity is kept. 

for(i = lw; i < hw; i = 

for(j = lw+1; j < hw; j = 
do 

if ((m 4 - 771,) < 0.95 && Ii < Ij ) rrii = ; h = ; 

2) m/z fragments that are in the mass range (x — 5/z,x + 5/z) are removed from the MS/MS 
spectrum, where x = 1.00727 + (mw - 1.00727) /z. 

3) m/z fragments that are lighter than 150 Da are removed from the MS/MS spectrum. 

4) The filtered spectrum is normalized to have maximum intensity 100 and fragments with 
normalized intensity less than 1 are removed from the spectrum. 

5) Try to determine if the spectrum is purely noise, 
if (z=l || 2=2) x = mw - 600 

else x = mw/z 

if ( Heaviest Filtered Fragment < x) 

exit (spectrum is noisy) 
else if ( Total Number of Filtered Fragments < 5) 

exit (spectrum is noisy) 

6) m/z fragments that are within ±1.5 Da are removed from the spectrum. When two fragments 
are within ±1.5 Da of each other the fragment with the highest intensity is kept, as shown in 
step 1). 
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7) The final filtered spectrum consists at most 50 fragments having the highest intensities. 

8) The molecular weights of the fragments in the filtered spectrum are transformed to inte- 
ger values using the MS/MS spectrum mass accuracy (5m). 

m, = int.[^ + 0.5] 

9) After the transformation above, the left and right mass index to mi are initialized as fol- 
lows. 

for(i=l; i < int [mw /5m~\; i = 
do 

if ( < /;) = J< 

if( I i+ i < h) = h 

Note: To speed up the code in RAId_deNovo implementation the intensity is further scale by 
multiplying it by a factor of 0.1: 1% = 0.1 x /j. 

10) Theoretical fragments chosen for scoring T(ir) = {b n ,yn}n=i F° r a precursor ion with 
charge z = 2, the score is give by: 





/T(tt) 




Hyperscore S(tt) = 41og 10 













The multiplication factor of "10" in the above scoring function is introduced because RAId_deNovo 
scaled the intensity by a factor of 0.1 as mentioned above. To keep RAId_deNovo's run time rea- 
sonable, for parent ion in higher charge state, RAId_deNovo scoring deviates from the XlTandem 
Hyperscore. Basically, counter b(y) totals the number of 6-(y-)type of evidence peaks without 
separating them further into different charge states. 

Crux Filtering and XCorr 

1) The intensities present in the MS/MS spectrum are transformed by taking the square root of 
the original intensities. 

h = yfh 

2) m/z fragments that are in the mass range (x — 15, x + 15) are removed from the MS/MS 
spectrum, where x = (mw + z — 1.0) /z. 

3) m/z fragments that are greater than x are removed from the MS/MS spectrum, where 
x = (mw + z - 1 + 50). 

4) Observed m/z fragments in the MS/MS spectrum are transformed to integer values using a 



22 



mass grid where neighboring points are spaced by 1.0005079 Da. 
x = (mw + z - 1 + 50) 
for(i = 0; i < x; i = s [i] =0 

for(i = 0; i < x; i = 
do 

n%i = int [mi/1 .0005079 + 0.5] 
if(s[mj< /i) s[mj = /.j 

5) The MS/MS spectrum's m/z range is divided into 10 mass regions. The width of the re- 
gion is equal to the heaviest observed fragment molecular weight hw divided by 10. 

mr = int[hw/10] 

The 10 regions are: [0, mr), [mr, 2mr), [2mr, 3mr), . . . , and [9mr, lOmr) 

6) The peak intensities within each region, if m/z fragments exist, are normalized to have 
maximum intensity 50. If no mass fragment is present in a given region, the maximum region 
intensity is set to 0. 

7) The final filtered spectrum is obtained by applying the following operation to the spectrum 
intensities 

x = (mw + z - 1 + 50) 
for( i = 0; i < x; i = i+1 

do for(v = 0, j = i-75; j<i+75; 



Ii = s[i] - v/150 

Note: To speed up the code in RAId_deNovo implementation the intensity is further scale by 
multiplying it by a factor of 0.1: Ii = 0.1 x Ij. 

8) The XCorr score is computed by taking the dot product between the theoretical fragments 
T(ir) and the filtered spectrum Ii. The default series used for scoring is T(tt) = {b n ,y n ,b n — 
H,b n + H, y n -H, y n + H,b n -H20,b n -NH 3 ,y n -NH 3 ,a n } 1 ^}^ 1 , with each series contributing 
to the score respectively weighted by Wi = {1,1,0.5,0.5,0.5,0.5,0.2,0.2,0.2,0.2}. 



Kscore Filtering and Scoring Function 

1) The intensities present in the MS/MS spectrum are transformed by taking the square root of 
the original intensities. 



do if (j >0 kk j < x) v = v + s [j] 



1 T{n) 

XCorr S(tt) = — ^ wJi 
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2) m/z fragments less than x are removed from the MS/MS spectrum, where x = (mw + [z — 
1) * 1.00075) x 2/2 + 10.5. 

3) The observed m/z fragments in the MS /MS spectrum are transformed to integer values 
using a mass grid where neighboring points are spaced by 1.0005 Da. 

for(i = 0; i < (mw+128) ; % = i+1) s[i]=0 
for(i = 0; i < hw; i = 
do 

mi = int {.mj 1.0005 + 0.5] 
if (s[?n ?; ]< s [mj = 

4) The spectrum's m/z range is partitioned into intervals with the number of intervals depending 
on the value of R, see below. 

x = (mw + (z-1) *1. 00075) x 2/z +10.5 
R = min(x,hw+10)-lw 

The number of partitions N(i?) is determined by the condition below 



10 


R > 3000 


9 


R > 2500 


8 


R > 2000 


7 


R > 1500 


6 


R > 1000 


5 


R > 



5) Within each partition the spectrum is scaled such that the maximum peak intensity in each 
interval equals to the maximum intensity in the MS/MS spectrum right after step 1). Peaks 
with intensities that are less than 5 percent of the maximum spectrum intensity are removed. 

6) The spectrum is normalized to a unit vector. 
for(x = 0, i = 0; i < hw; i = i+1) x = x+s [i] x s [i] 
for(i = 0; i< hw; i = i+1) s [i] = s [i] / \fx 

7) The final filtered spectrum is obtained by applying the following transformation to the peak 
intensities 

for(i = 0; i < hw; i = i+1) 

do for(v = 0, j = i-50; j<i+50; 

do if (j > && j < hw) v = v+s [j] 
if (s[i]- v/101 > ) 

h = s[i] - v/101 
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8) The Kscore is computed by taking the dot product between the theoretical fragments T(n) 
and the filtered spectrum 7j. The default fragmentation series used for scoring are T(ir) = 
{b n , y n , b n — H,b n + H, y n — H,y n + H} 1 ^^ 1 , with each series contributing to the score respec- 
tively weighted by u), = {1,1,0.5,0.5,0.5,0.5}. 



B: Correlations among Filtering Strategies 

Although all scoring functions are different, they do score more or less the same fragment series. 
Therefore, the major difference in their strengths must largely come from steps other than the 
final peptide scoring. In this appendix, we investigate the correlations among filtering strategies 
of various scoring functions implemented in RAId_deNovo. 

The correlation between any pair of filtering strategies can be quantified statistically as 
follows. Given a set of raw spectra, one may process these spectra with a pair of different 
strategies. For each raw spectrum, one obtains two different filtered spectra and computes their 
correlation. The correlation between every pair of filtered spectra can be collected to form the 
correlation histogram, reflecting the correlation between a pair of filtering strategies. Figure PTT1 
and Figure [T8l exhibit the correlation histograms between each pair of filtering strategies under 
different data type, centroid (A1-A4 of ISB data set^) and profile (NHLBI data set£). The large 
correlation between XCorr and Kscore may be the cause of their significant scoring correlation 
observed. 
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FIGURES 




Figure 1: De novo mass grid. For illustration purpose, the mass grid is made IDa apart. Each mass grid 
contains a score histogram, with left column indicating score and the right column recording the number 
of partial peptides reaching that mass index with score to its left. The score histogram is obtained using 
a backtracking update rule. For example, at the mass grid 558, the local score contribution from evidence 
peaks in the spectrum is assumed to contribute A amount of score. Looking back to mass grid 501, that 
is 57 Da less than 558 Da, one knows that by growing a glycine residue the partial peptides reaching mass 
index 501 will then advance to grid 558. Similarly, any partial peptides reaching mass index 487 will 
reach mass index 558 by growing an alanine residue. Therefore, at mass index 558 the score histogram is 
the superposition of score histograms associated with the other twenty lighter mass grids corresponding 
respectively to the twenty amino acids. The illustration is drawn as if there arc only two amino acids, 
glycine and alanine. When one weights each peptide by its elemental composition, in the histogram the 
counts next to the score is weighted and no longer an integer. For example, the weighted count n(558) 
at mass grid 558 will be given by n(558) = X)a=i Pa ?^(558 — m a ) where m a is the mass of amino acid a 
rounded to the nearest Da and p a is the emitting probability associated with amino acid a. When one 
suppresses the score and only counts number of partial peptides reaching a certain mass grid, the update 
rule readily provides the total number of peptides within a given mass range. 
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Figure 2: Filtering accuracy assessment. For every raw spectrum, one generates six filtered spectra: 
three associated with Hypcrscorc/XCorr/Kscore implemented in RAId_deNovo and the other three re- 
spectively produced by X!Tandcm/Crux/X!Tandcm(with Kscorc plug-in). The mass fragments of every 
filtered spectrum are then read to a mass grid. The spectrum is then viewed as a vector with non- 
vanishing components only at the component/mass indices populated. One then normalize each filtered 
spectrum into unit length. An inner product of any two filtered spectral vectors represents the correlation 
between them. When the spectral quality does not pass a method-dependent threshold, the correspond- 
ing filtering protocol may turn the raw spectrum into a null spectrum without further searching the 
database. Therefore the total number of spectra passing through the filtering stage might be smaller 
than the total number of raw spectra, which is also reflected in the histograms. Two sets of data are 
used for this evaluation. The centroid data, consisting of 38,424 spectra arc from the ISB data set^. 
The profile data, consisting of 10, 000 spectra, are from the NHLBI data set£. Panel A(D) shows the 
histogram of correlation between the RAId_deNovo Kscore and the X! Tandem Kscore plug- in using cen- 
troid(profile) data. Panel B(E) shows the histogram of correlation between the RAId_dcNovo XCorr and 
the Crux XCorr using centroid (profile) data. Panel C(F) shows the histogram of correlation between the 
RAId_dcNovo Hyperscore and the XITandcm Hypcrscore using centroid(profilc) data. The correlation 
strength being always one means that RAId_deNovo is able to faithfully reproduce the filtering strategies 
originally designed for Hyperscore, XCorr, and Kscore. 
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Figure 3: Example processed spectra by different scoring functions versus the original spectrum. The 
centroid spectrum used has a parent ion mass of 1640.80 Da. In panel (A), the original spectrum is 
displayed; (B) shows the processed spectrum generated by the filtering protocol of RAId.DbS scoring 
function; (C) exhibits the processed spectrum generated by the filtering protocol of Kscore; while (D) 
and (E) correspond respectively to the processed spectra produced by XCorr and Hypcrscore. 



31 



Length 



Length 



Length 



S=50 


5 


6 


• « 




3 




S=ll 


23 


21 


• • 


» • 


5 


S=10 


4 


11 


• < 




17 




L=3 


L=4 


• « 


» • 


L=7 



S=50 


2 


11 


• 


• • 


9 




S=50+ A 


7 


17 


• 1 


t • 


12 










• 


























• 
• 




core 














S=ll 


33 


2 


• 


• • 


10 


CO 


S=ll+ A 


56 


23 


• ) 


» • 


15 


S=10 


6 


7 


• 


• • 


14 




S=10+ A 


10 


18 


• 4 


» • 


31 




1=3 


L=4 


• 


• • 


L=7 






L=4 


L=5 


• 1 


t • 


L=8 




Figure 4: Illustration of de novo mass grid with internal structure. The features mentioned in the caption 
of Figure Q] won't be repeated here. Using the peptide lengths as an example, this figure illustrate the 
possibility of including additional structures in the (raw) score histogram associated with each mass grid. 
The basic idea is that the (weighted) count associated with each (raw) score is further categorized by 
the lengths of partial peptides reaching each mass index. In the end, one will apply the length correction 
factor to the raw score to obtain the real score histogram. Apparently, one may also keep track of number 
of b (y) peaks accumulated within the raw score histogram. Again, the factorial contribution can be added 
in the end prior to the construction of the final score histogram. 



32 




Figure 5: Score correlations. A subset of the ISB centroid data set^ was used to perform this evaluation. 
In terms of a scoring function, when the best hit per spectrum (analyzed using the analysis program that 
such scoring function was originally used for) is also a true positive, that candidate peptide is scored 
again using the corresponding scoring function implemented in RAId_dcNovo. Each true positive best hit 
thus gives rise to two scores and plotted using the following rule: the first score is used as the ordinate 
while the second score (from RAId_dcNovo) is used as the abscissa. Including 500 spectra, panel A is for 
the RAId score. Panel B is for Hyperscore and contains 248 spectra. The result of Kscore is shown in 
panel C with 220 spectra. Shown with 500 spectra, panel D documents the results for XCorr. 
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Figure 6: core correlations. A subset of the NHLBI profile data set£ was used to perform this evaluation. 
In terms of a scoring function, when the best hit per spectrum (analyzed using the analysis program that 
such scoring function was originally used for) is also a true positive, that candidate peptide is scored 
again using the corresponding scoring function implemented in RAId_dcNovo. Each true positive best hit 
thus gives rise to two scores and plotted using the following rule: the first score is used as the ordinate 
while the second score (from RAId_dcNovo) is used as the abscissa. Including 500 spectra, panel A is for 
the RAId score. Panel B is for Hyperscore and contains 495 spectra. The result of Kscore is shown in 
panel C with 310 spectra. Shown with 500 spectra, panel D documents the results for XCorr. 
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Figure 7: E- value accuracy assessment. The agreement between the reported E- value and the 
textbook definition is examined using centroid data (A1-A4 subsets of ISB data set). The 
random database size used is 500 MB. The molecular weight range considered while searching 
the database is [MW - 5, MW + <5]. 
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Figure 8: E-value accuracy assessment. The agreement between the reported E'-value and the 
textbook definition is examined using profile data (the NHLBI data set: 10,000 spectra). The 
random database size used is 500 MB. The molecular weight range considered while searching 
the database is [MW - 5, MW + <5]. 
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Figure 9: E-value accuracy assessment. The agreement between the reported E'-value and the 
textbook definition is examined using profile data (panel (A-B), 10,000 spectra of the NHLBI 
data set) as well as centroid data (panel (C-D), A1-A4 subsets of ISB data set). The NCBI's nr 
(of size 500 MB) database with true positives removed is used for this assessment. The molecular 
weight range considered while searching the database is [MW — 5, MW + S] . 
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Figure 10: ROC curves for the profile data (NHLBI data set£). For each of the four scoring 
functions considered, a set of ROC curves are shown. These ROC curves include the results 
from running the designated program associated with that scoring function, the results from 
running RAId_deNovo in the database search mode, and the results from combining with any 
one of the three other scoring functions. Panel (A) shows the results from RAId score, whose 
designated program is RAId_DbS. Panel (B) displays the results from Kscore, whose designated 
program is X!Tandem. Panel (C) exhibits the results from XCorr, which is mostly employed by 
SEQUEST. Panel (D) presents the results from Hyperscore, whose designated program is also 
X! Tandem. 
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Figure 11: ROC curves for the centroid data (A1-A4 of the ISB data set^). For each of the four 
scoring functions considered, a set of ROC curves are shown. These ROC curves include the 
results from running the designated program associated with that scoring function, the results 
from running RAId_deNovo in the database search mode, and the results from combining with 
one of the three other scoring functions. Panel (A) shows the results from RAId score, whose 
designated program is RAId_DbS. Panel (B) displays the results from Kscore, whose designated 
program is X!Tandem. Panel (C) exhibits the results from XCorr, which is mostly employed by 
SEQUEST. Panel (D) presents the results from Hyperscore, whose designated program is also 
XITandem. Instead of using only XCorr (like RAId_deNovo) , SEQUEST program first selects 
the top 500 candidates using SP score. As shown in panel (C), for centroid data there is an 
advantage to filter candidates with the SP score. However, it is also seen that by combining 
XCorr with either RAId score or Hyperscore, equally good results can be attained without 
introducing the SP score heuristics. 
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Figure 12: ROC curves for the profile data (NHLBI data set£). For each of the four scoring 
functions considered, a set of ROC curves are shown. These ROC curves include in the con- 
sideration only the best hit per spectrum from running the designated program associated with 
that scoring function, the best hit per spectrum from running RAId_deNovo in the database 
search mode, and the best hit per spectrum from combining with any one of the three other 
scoring functions. Panel (A) shows the results from RAId score, whose designated program is 
RAId_DbS. Panel (B) displays the results from Kscore, whose designated program is X!Tandem. 
Panel (C) exhibits the results from XCorr, which is mostly employed by SEQUEST. Panel (D) 
presents the results from Hyperscore, whose designated program is also XITandem. 
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Figure 13: ROC curves for the centroid data (A1-A4 of the ISB data set^). For each of the four 
scoring functions considered, a set of ROC curves are shown. These ROC curves include in the 
consideration only the best hit per spectrum from running the designated program associated 
with that scoring function, the best hit per spectrum from running RAId_deNovo in the database 
search mode, and the best hit per spectrum from combining with any one of the three other 
scoring functions. Panel (A) shows the results from RAId score, whose designated program is 
RAId_DbS. Panel (B) displays the results from Kscore, whose designated program is X!Tandem. 
Panel (C) exhibits the results from XCorr, which is mostly employed by SEQUEST. Panel (D) 
presents the results from Hyperscore, whose designated program is also X!Tandem. Instead of 
using only XCorr (like RAId_deNovo) , SEQUEST program first selects the top 500 candidates 
using SP score. As shown in panel (C), for centroid data there is advantage to filter candidates 
with the SP score. However, it is also seen that by combining XCorr with either RAId score or 
Hyperscore, equally good results can be attained without introducing the SP score heuristics. 
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Figure 14: ROC curves when highly homologous proteins* are also counted as true positive 
proteins. Plots done this way are analogous to the ROC plots obtained from using decoy 
database to estimate the number of false positives. Each panel displays the results of a scoring 
function. The resulting ROC curves from using RAId_deNovo implementation and the original 
implementation in other search program are both shown. The results from profile data (NHLBI 
data set£) are shown in solid curves, while the results from centroid data (A1-A4 of ISB data 
set^) are shown in long-dash curves. Panels (A,B,C,D) respectively display the results from 
using RAId score, Kscore, XCorr, and Hyperscore. Except for RAId score, the RAId_deNovo 
implemented scoring functions performs comparably to the original implementation in other 
search methods. 
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Figure 15: ROC curves when highly homologous proteins* removed from the nr database, thus 
not counted towards true positives or false positives. Each panel displays the results of a scoring 
function. The resulting ROC curves from using RAId_deNovo implementation and the original 
implementation in other search program are both shown. The results from profile data (NHLBI 
data set^) are shown in solid curves, while the results from centroid data (A1-A4 of ISB data 
set^) are shown in long-dash curves. Panels (A,B,C,D) respectively display the results from 
using RAId score, Kscore, XCorr, and Hyperscore. Except for RAId score, the RAId_deNovo 
implemented scoring functions performs comparably to the original implementation in other 
search methods. Solid line curve NHLBI data set, long-dash curve the A1-A4 of ISB data set. 
Highly homologous proteins removed from nr. 
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Figure 16: Illustration of the RAIcLdeNovo performance when combining three different scoring 
functions. Panel (A) shows the results from the profile data (NHLBI data set^), while panel (B) 
exhibits the results from the centroid data (A1-A4 of the ISB data set— £). Panel (C) shows the 
results from the profile data but keeping only best hit per spectrum, while panel (D) exhibits 
the results from the centroid data but keeping only best hit per spectrum. 
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Figure 17: Histograms of correlations between filtering strategies. Used in this plot are 38, 424 
raw centroid spectra from the ISB data set^. Running through the four different filtering 
strategies, each raw spectrum will have four different processed spectra. The mass fragments 
of every filtered spectrum are then read to a mass grid. The spectrum is then viewed as a 
vector with non- vanishing components only at the component /mass indices populated. One then 
normalize each filtered spectrum into unit length. An inner product of any two filtered spectral 
vectors represents the correlation between them. When the spectral quality does not pass a 
method-dependent threshold, the corresponding filtering protocol may turn the raw spectrum 
into a null spectrum without further searching the database. For a given pair of filtering methods 
and a raw spectrum, if each of the two filtering methods produces a nonempty filtered spectrum, 
one may turn those filtered spectra into spectral vectors and compute their inner product, i.e., 
their correlation. For each pair of filtering methods, these inner products are accumulated and 
plotted as a correlation histogram. All six pairwise combinations are shown. 
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Figure 18: Histograms of correlations between filtering strategies. Same as Figure [T7] except 
that the 10, 000 raw spectra used are profile data from the NHLBI data set—. 
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Figure 19: Example score PDF (normalized histogram) output by RAIcLdeNovo. An MS 2 spectrum of 
parent ion mass 1640.80 Da is queried with default parameters. (A) RAId . (B) is Kscore , (C)XCorr, 
(D) Hyperscore. The total number of possible peptides within ± 3Da of parent ion mass is about 10 19 . 
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Figure 20: Example of reanalyzing output files from other search engine by combining with 
statistical significance assignment from RAIcLdeNovo. In this example, we use the Mascot 
output files resulting from querying profile spectra (panel (A), NHLBI data set) and centroid 
spectra (panel (B), A1-A4 of the ISB data set^) to the NCBI's nr database with proteins highly 
homologous to that were present in the mixture removed. Since each data set is from a known 
mixture of proteins, it is possible to remove the homologous proteins with respect to the true 
positives from the nr database. We then combine the calibrated i?-value^ of Mascot with the 
i?-value obtained from RAId_deNovo when either RAId score, Hyperscore, Kscore or XCorr is 
used. 
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Table 1: The table is an example of the combine E-value from RAIcLdeNovo. 



J_j_UU11±U 


R ATH 






IVoLUl t; 




1.35e— 41 




8 26e— 11 




7 QQp— 1 3 


NYQEAKDAFLGSFTYEYSR 


1.43 


379.00 


0.08 


453.00 


101.00 


APTSAGPWEKPTVEEALESGSR 


1.85 


28.50 


1.94 


9.01 


0.15 


LERMTQALALQAGSLEDGGPSR 


3.38 


13.60 


0.30 


88.40 


4.32 


TEDQRPQLDPYQILGPTSSR 


4.04 


15.80 


18.40 


0.38 


18.30 


NYKAKQGGLRFAHLLDQVSR 


8.81 


257.00 


1.48 


1170.00 


1280.00 


DTPMLLYLNTHTALEQMRR 


9.58 


8.76 


1.66 


353.00 


37.20 


EKTESSGQETTAKCDRASKSR 


9.75 


1.71 


8.15 


82.80 


6.99 


LLAQQSLNQQYLNHPPPVSR 


10.80 


358.00 


1.95 


311.00 


269.00 


IQHGQCAYTFILPEHDGNCR 



49 



